For our final project we incorpate multiple data sets from the New York State official census database (focusing on crime rates and student obesity) and produce interesting visualizations/data operations using the skills we have learned this semester.
Authors: Bruno De Hoyos (bd8739), Estevan Garza (eg22453)
We begin by importing some useful libraries:
require(bitops)
## Loading required package: bitops
require(stats)
require(ggplot2)
## Loading required package: ggplot2
require(grid)
## Loading required package: grid
require(plyr)
## Loading required package: plyr
require(reshape2)
## Loading required package: reshape2
require(RCurl)
## Loading required package: RCurl
require(gplots)
## Loading required package: gplots
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(jsonlite)
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
require("scatterplot3d")
## Loading required package: scatterplot3d
require("rgl")
## Loading required package: rgl
require("RColorBrewer")
## Loading required package: RColorBrewer
We now begin building our data frames. We build the data frames through the data.frame() function, and pass to it a URL to an OracleCloud server with an appropriate SQL query. In this case we query the tables “TYPE_CRIMES_BY_COUNTY”, “SENTENCES_TO_PROBATION_BY_YEAR”,“ADULT_ARRESTS_BY_COUNTY”, and “weight3” which we have already imported into the Oracle database, and and rename the dataframes ‘dft’, ‘dfs’, and ‘dfr’ respectively.
source("../01 Data/CrimeDataFrames.R", echo=TRUE, max.deparse.length=1e3)
##
## > dft <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from TYPE_CRIMES_BY_COUNTY\""),
## + httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL",
## + USER = "C##cs329e_eg22453", PASS = "orcl_eg22453", MODE = "native_mode",
## + MODEL = "model", returnDimensions = "False", returnFor = "JSON"),
## + verbose = TRUE)))
##
## > dfs <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from SENTENCES_TO_PROBATION_BY_YEAR\""),
## + httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL",
## + USER = "C##cs329e_eg22453", PASS = "orcl_eg22453", MODE = "native_mode",
## + MODEL = "model", returnDimensions = "False", returnFor = "JSON"),
## + verbose = TRUE)))
##
## > dfr <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from ADULT_ARRESTS_BY_COUNTY\""),
## + httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL",
## + USER = "C##cs329e_eg22453", PASS = "orcl_eg22453", MODE = "native_mode",
## + MODEL = "model", returnDimensions = "False", returnFor = "JSON"),
## + verbose = TRUE)))
##
## > dfr2 <- dfr %>% filter(COUNTY != "Unknown-NonNYS",
## + COUNTY != "Unknown NYC county")
##
## > dfs2 <- dfs %>% filter(CONVICTION == "Felony")
source("../01 Data/StudentWeightDataFrame.R", echo=TRUE, max.deparse.length=1e3)
##
## > student_weight <- data.frame(eval(parse(text = substring(getURL(URLencode("http://129.152.144.84:5001/rest/native/?query=\"select * from weight3\""),
## + httpheader = c(DB = "jdbc:oracle:thin:@129.152.144.84:1521:ORCL",
## + USER = "C##cs329e_bd8739", PASS = "orcl_bd8739", MODE = "native_mode",
## + MODEL = "model", returnFor = "R", returnDimensions = "False"),
## + verbose = TRUE), 1, 2^31 - 1))))
##
## > student_weight[, 1] <- as.numeric(as.character(student_weight[,
## + 1]))
##
## > student_weight[, 6] <- as.numeric(as.character(student_weight[,
## + 6]))
##
## > student_weight[, 7] <- as.numeric(as.character(student_weight[,
## + 7]))
##
## > student_weight[, 8] <- as.numeric(as.character(student_weight[,
## + 8]))
##
## > student_weight[, 9] <- as.numeric(as.character(student_weight[,
## + 9]))
##
## > student_weight[, 10] <- as.numeric(as.character(student_weight[,
## + 10]))
##
## > student_weight[, 11] <- as.numeric(as.character(student_weight[,
## + 11]))
##
## > student_weight[, 17] <- as.numeric(as.character(student_weight[,
## + 17]))
##
## > student_weight[, 18] <- as.numeric(as.character(student_weight[,
## + 18]))
##
## > student_weight[, 19] <- as.numeric(as.character(student_weight[,
## + 19]))
We then display the ‘head’ of the data frames to ensure all the data was loaded in properly.
source("../01 Data/DataFramShow.R", echo=TRUE, max.deparse.length=1e3)
##
## > tbl_df(dft)
## Source: local data frame [1,488 x 11]
##
## COUNTY YEAR POPULATION INDEX_COUNT INDEX_RATE VIOLENT_COUNT
## 1 Erie 2009 905273 32656 3607.3 4915
## 2 Essex 2009 36495 519 1422.1 49
## 3 Franklin 2009 50448 937 1857.4 65
## 4 Fulton 2009 54912 1522 2771.7 112
## 5 Genesee 2009 57420 1497 2607.1 92
## 6 Greene 2009 49071 853 1738.3 102
## 7 Hamilton 2009 4980 51 1024.1 2
## 8 Herkimer 2009 61943 1369 2210.1 168
## 9 Jefferson 2009 118766 2701 2274.2 224
## 10 Kings 2009 2567968 55816 2173.5 15803
## .. ... ... ... ... ... ...
## Variables not shown: VIOLENT_RATE (dbl), PROPERTY_COUNT (int),
## PROPERTY_RATE (dbl), FIREARM_COUNT (fctr), FIREARM_RATE (fctr)
##
## > tbl_df(dfs2)
## Source: local data frame [2,728 x 9]
##
## COUNTY YEAR CONVICTION VIOLENT_FELONY_OFFENSES DRUGS DWI PROPERTY
## 1 Orleans 2011 Felony 0 12 2 6
## 2 Oswego 2011 Felony 3 6 25 31
## 3 Otsego 2011 Felony 0 2 14 9
## 4 Putnam 2011 Felony 4 3 20 8
## 5 Queens 2011 Felony 256 64 67 108
## 6 Rensselaer 2011 Felony 10 14 34 9
## 7 Richmond 2011 Felony 38 31 3 22
## 8 Rockland 2011 Felony 15 40 40 48
## 9 Saratoga 2011 Felony 10 15 74 18
## 10 Schenectady 2011 Felony 14 13 32 18
## .. ... ... ... ... ... ... ...
## Variables not shown: OTHER (int), TOTAL (int)
##
## > tbl_df(dfr2)
## Source: local data frame [2,728 x 13]
##
## COUNTY YEAR TOTAL FELONY_TOTAL DRUG_FELONY VIOLENT_FELONY
## 1 Schuyler 2006 380 81 12 11
## 2 Seneca 2006 760 187 22 31
## 3 Steuben 2006 2156 625 96 134
## 4 St. Lawrence 2006 2843 768 125 141
## 5 Suffolk 2006 26353 7523 1424 1770
## 6 Sullivan 2006 2161 749 141 179
## 7 Tioga 2006 813 222 30 38
## 8 Tompkins 2006 1578 344 46 97
## 9 Ulster 2006 5024 1249 230 234
## 10 Warren 2006 2325 616 48 121
## .. ... ... ... ... ... ...
## Variables not shown: DWI_FELONY (int), OTHER_FELONY (int),
## MISDEMEANOR_TOTAL (int), DRUG_MISD (int), DWI_MISD (int), PROPERTY_MISD
## (int), OTHER_MISD (int)
##
## > tbl_df(student_weight)
## Source: local data frame [3,270 x 19]
##
## LOCATION_CODE COUNTY AREA_NAME
## 1 30701 BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 3 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 4 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 5 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 6 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 7 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 8 40901 CATTARAUGUS ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## 9 40901 CATTARAUGUS ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## 10 40901 CATTARAUGUS ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## .. ... ... ...
## Variables not shown: REGION (fctr), SCHOOL_YEARS (fctr), NO_OVERWEIGHT
## (dbl), PCT_OVERWEIGHT (dbl), NO_OBESE (dbl), PCT_OBESE (dbl),
## NO_OVERWEIGHT_OR_OBESE (dbl), PCT_OVERWEIGHT_OR_OBESE (dbl), GRADE_LEVEL
## (fctr), AREA_TYPE (fctr), STREET_ADDRESS (fctr), CITY (fctr), STATE
## (fctr), ZIP_CODE (dbl), LAT (dbl), LON (dbl)
Now that we have the three tables loaded into Rstudio, we can now join the tables. We join all three tables by first joining any two tables based on the commmon ‘COUNTY’ and ‘YEAR’ columns, placing that table into a temporary data frame, and then joining the temporary data frame with the third table. We do this operation twice, once for a full-join, and another time for an inner-join. We also make use of the complete.cases function for the final tables in order to remove any NA values in rows. We also convert the types of some of the columns to the numerical format, using the as.numeric() function, to make faciliate plotting this data.
source("../01 Data/CrimeDataFrameJoin.R", echo=TRUE, max.deparse.length=1e3)
##
## > temp <- full_join(dfs2, dfr2, by = c("COUNTY", "YEAR"))
##
## > full <- full_join(temp, dft, by = c("COUNTY", "YEAR"))
##
## > cleantbl <- full[complete.cases(full), ]
##
## > cleantbl[, 4] <- as.numeric(as.character(cleantbl[,
## + 4]))
##
## > cleantbl[, 5] <- as.numeric(as.character(cleantbl[,
## + 5]))
##
## > cleantbl[, 6] <- as.numeric(as.character(cleantbl[,
## + 6]))
##
## > cleantbl[, 7] <- as.numeric(as.character(cleantbl[,
## + 7]))
##
## > cleantbl[, 8] <- as.numeric(as.character(cleantbl[,
## + 8]))
##
## > cleantbl[, 9] <- as.numeric(as.character(cleantbl[,
## + 9]))
##
## > cleantbl[, 10] <- as.numeric(as.character(cleantbl[,
## + 10]))
##
## > cleantbl[, 11] <- as.numeric(as.character(cleantbl[,
## + 11]))
##
## > cleantbl[, 12] <- as.numeric(as.character(cleantbl[,
## + 12]))
##
## > cleantbl[, 13] <- as.numeric(as.character(cleantbl[,
## + 13]))
##
## > cleantbl[, 14] <- as.numeric(as.character(cleantbl[,
## + 14]))
##
## > cleantbl[, 15] <- as.numeric(as.character(cleantbl[,
## + 15]))
##
## > cleantbl[, 16] <- as.numeric(as.character(cleantbl[,
## + 16]))
##
## > cleantbl[, 17] <- as.numeric(as.character(cleantbl[,
## + 17]))
##
## > cleantbl[, 18] <- as.numeric(as.character(cleantbl[,
## + 18]))
##
## > cleantbl[, 19] <- as.numeric(as.character(cleantbl[,
## + 19]))
##
## > cleantbl[, 20] <- as.numeric(as.character(cleantbl[,
## + 20]))
##
## > cleantbl[, 21] <- as.numeric(as.character(cleantbl[,
## + 21]))
##
## > cleantbl[, 22] <- as.numeric(as.character(cleantbl[,
## + 22]))
##
## > cleantbl[, 23] <- as.numeric(as.character(cleantbl[,
## + 23]))
##
## > cleantbl[, 24] <- as.numeric(as.character(cleantbl[,
## + 24]))
##
## > cleantbl[, 25] <- as.numeric(as.character(cleantbl[,
## + 25]))
##
## > cleantbl[, 26] <- as.numeric(as.character(cleantbl[,
## + 26]))
##
## > cleantbl[, 27] <- as.numeric(as.character(cleantbl[,
## + 27]))
##
## > cleantbl[, 28] <- as.numeric(as.character(cleantbl[,
## + 28]))
##
## > cleantbl[, 29] <- as.numeric(as.character(cleantbl[,
## + 29]))
##
## > cleantbl <- cleantbl[complete.cases(cleantbl), ]
##
## > head(cleantbl)
## COUNTY YEAR CONVICTION VIOLENT_FELONY_OFFENSES DRUGS DWI PROPERTY
## 1 Orleans 2011 Felony 0 12 2 6
## 2 Oswego 2011 Felony 3 6 25 31
## 3 Otsego 2011 Felony 0 2 14 9
## 4 Putnam 2011 Felony 4 3 20 8
## 6 Rensselaer 2011 Felony 10 14 34 9
## 8 Rockland 2011 Felony 15 40 40 48
## OTHER TOTAL.x TOTAL.y FELONY_TOTAL DRUG_FELONY VIOLENT_FELONY DWI_FELONY
## 1 5 25 858 249 47 65 22
## 2 10 75 3180 893 94 208 91
## 3 1 26 1546 427 76 82 44
## 4 7 42 1560 372 98 66 43
## 6 26 93 3485 1094 167 255 79
## 8 46 189 4413 1420 274 367 111
## OTHER_FELONY MISDEMEANOR_TOTAL DRUG_MISD DWI_MISD PROPERTY_MISD
## 1 115 609 42 107 214
## 2 500 2287 186 472 728
## 3 225 1119 89 291 335
## 4 165 1188 189 451 178
## 6 593 2391 210 425 782
## 8 668 2993 252 630 919
## OTHER_MISD POPULATION INDEX_COUNT INDEX_RATE VIOLENT_COUNT VIOLENT_RATE
## 1 246 43076 1004 2330.8 82 190.4
## 2 901 122658 3206 2613.8 181 147.6
## 3 404 62539 1144 1829.3 119 190.3
## 4 370 100158 860 858.6 66 65.9
## 6 974 160146 4682 2923.6 485 302.8
## 8 1192 313088 4564 1457.7 553 176.6
## PROPERTY_COUNT PROPERTY_RATE FIREARM_COUNT FIREARM_RATE
## 1 922 2140.4 5 11.6
## 2 3025 2466.2 17 13.9
## 3 1025 1639.0 18 28.8
## 4 794 792.7 1 1.0
## 6 4197 2620.7 112 69.9
## 8 4011 1281.1 44 14.1
##
## > inner_temp <- inner_join(dfs2, dfr2, by = c("COUNTY",
## + "YEAR"))
##
## > inner <- inner_join(inner_temp, dft, by = c("COUNTY",
## + "YEAR"))
##
## > cleaninner <- inner[complete.cases(inner), ]
##
## > cleaninner[, 4] <- as.numeric(as.character(cleaninner[,
## + 4]))
##
## > cleaninner[, 5] <- as.numeric(as.character(cleaninner[,
## + 5]))
##
## > cleaninner[, 6] <- as.numeric(as.character(cleaninner[,
## + 6]))
##
## > cleaninner[, 7] <- as.numeric(as.character(cleaninner[,
## + 7]))
##
## > cleaninner[, 8] <- as.numeric(as.character(cleaninner[,
## + 8]))
##
## > cleaninner[, 9] <- as.numeric(as.character(cleaninner[,
## + 9]))
##
## > cleaninner[, 10] <- as.numeric(as.character(cleaninner[,
## + 10]))
##
## > cleaninner[, 11] <- as.numeric(as.character(cleaninner[,
## + 11]))
##
## > cleaninner[, 12] <- as.numeric(as.character(cleaninner[,
## + 12]))
##
## > cleaninner[, 13] <- as.numeric(as.character(cleaninner[,
## + 13]))
##
## > cleaninner[, 14] <- as.numeric(as.character(cleaninner[,
## + 14]))
##
## > cleaninner[, 15] <- as.numeric(as.character(cleaninner[,
## + 15]))
##
## > cleaninner[, 16] <- as.numeric(as.character(cleaninner[,
## + 16]))
##
## > cleaninner[, 17] <- as.numeric(as.character(cleaninner[,
## + 17]))
##
## > cleaninner[, 18] <- as.numeric(as.character(cleaninner[,
## + 18]))
##
## > cleaninner[, 19] <- as.numeric(as.character(cleaninner[,
## + 19]))
##
## > cleaninner[, 20] <- as.numeric(as.character(cleaninner[,
## + 20]))
##
## > cleaninner[, 21] <- as.numeric(as.character(cleaninner[,
## + 21]))
##
## > cleaninner[, 22] <- as.numeric(as.character(cleaninner[,
## + 22]))
##
## > cleaninner[, 23] <- as.numeric(as.character(cleaninner[,
## + 23]))
##
## > cleaninner[, 24] <- as.numeric(as.character(cleaninner[,
## + 24]))
##
## > cleaninner[, 25] <- as.numeric(as.character(cleaninner[,
## + 25]))
##
## > cleaninner[, 26] <- as.numeric(as.character(cleaninner[,
## + 26]))
##
## > cleaninner[, 27] <- as.numeric(as.character(cleaninner[,
## + 27]))
##
## > cleaninner[, 28] <- as.numeric(as.character(cleaninner[,
## + 28]))
##
## > cleaninner[, 29] <- as.numeric(as.character(cleaninner[,
## + 29]))
##
## > cleaninner <- inner[complete.cases(inner), ]
The next step in the project is to start doing some data wrangling. The first wrangling procedure involves filtering out the top 50th percentile of the data with respect to DWI misdemeanors.
source("../02 Data Wrangling/Wrangle1.R", echo = TRUE, max.deparse.length=1e3)
##
## > wr1 <- cleantbl %>% mutate(misd_pct = cume_dist(DWI_MISD)) %>%
## + filter(misd_pct <= 0.5)
The second wrangling procedure makes use of the mutate function and adds a new column to the joined tables representing the sum of Property, Violence, and Firearm crime rates in each New York county.
source("../02 Data Wrangling/Wrangle2.R", echo = TRUE, max.deparse.length=1e3)
##
## > wr2 <- cleantbl %>% mutate(prop_vio_fire = PROPERTY_RATE +
## + VIOLENT_RATE + FIREARM_RATE)
The next and most complex wrangling procedure involves selecting various crime related columns (which including different variations of felony and misdemeanor crimes), groups them by year, and summarises the mean for each crime column. We utilize the melt function introduced at the beginning of the semester in order to stack the respective column data on top of each other, which all share the YEAR column.
source("../02 Data Wrangling/Wrangle3.R", echo = TRUE, max.deparse.length=1e3)
##
## > wr3 <- cleantbl %>% select(YEAR, DRUGS, DWI, PROPERTY,
## + DRUG_FELONY, VIOLENT_FELONY, DWI_FELONY, DRUG_MISD, DWI_MISD,
## + PROPERTY_MISD, VIOLENT_COUNT, PROPERTY_COUNT, FIREARM_COUNT) %>%
## + group_by(YEAR) %>% summarise(`Drug Probation` = mean(DRUGS),
## + `DWI Probation` = mean(DWI), `Property Probation` = mean(PROPERTY),
## + `Drug Felonies` = mean(DRUG_FELONY), `Violent Felonies` = mean(VIOLENT_FELONY),
## + `DWI Felonies` = mean(DWI_FELONY), `Drug Misdemeanors` = mean(DRUG_MISD),
## + `DWI Misdemeanors` = mean(DWI_MISD), `Property Misdemeanors` = mean(PROPERTY_MISD),
## + `Violent Crime Count` = mean(VIOLENT_COUNT), `Property Crime Count` = mean(PROPERTY_COUNT),
## + `Firearm Crime Count` = mean(FIREARM_COUNT)) %>% melt(.,
## + id.vars = "YEAR", measure.vars = c("Drug Probation", "DWI Probation",
## + "Property Probation", "Drug Felonies", "Violent Felonies",
## + "DWI Felonies", "Drug Misdemeanors", "DWI Misdemeanors",
## + "Property M ..." ... [TRUNCATED]
Now we implement the first of our data wrangling for the student obesity table and through use of the pipe operators, filter, mutate and cumulative distribution functions, produce a new data frame with a new variable, “owobpct”. This variable ranks each individual school district level (not the DISTRICT_TOTAL numbers however which have been filtered out using the != operator) in orders of percentiles concerning overweight OR obese students.
source("../02 Data Wrangling/student_wrangle1.R", echo = TRUE, max.deparse.length=1e3)
##
## > st1 <- student_weight %>% mutate(ow_ob_pct = cume_dist(NO_OVERWEIGHT_OR_OBESE)) %>%
## + filter(ow_ob_pct <= 0.9, GRADE_LEVEL != "DISTRICT TOTAL")
##
## > head(st1)
## LOCATION_CODE COUNTY AREA_NAME
## 1 30701 BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 3 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 4 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 5 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 6 40901 CATTARAUGUS ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## REGION SCHOOL_YEARS NO_OVERWEIGHT PCT_OVERWEIGHT NO_OBESE
## 1 NEW YORK - PENN 2012-2013 42 0.170 58
## 2 NEW YORK - PENN 2012-2013 54 0.139 65
## 3 NEW YORK - PENN 2012-2013 44 0.175 61
## 4 WESTERN NEW YORK 2012-2013 69 0.174 56
## 5 WESTERN NEW YORK 2012-2013 25 0.153 35
## 6 WESTERN NEW YORK 2012-2013 30 0.203 23
## PCT_OBESE NO_OVERWEIGHT_OR_OBESE PCT_OVERWEIGHT_OR_OBESE GRADE_LEVEL
## 1 0.235 100 0.41 MIDDLE/HIGH
## 2 0.167 119 0.31 ELEMENTARY
## 3 0.242 105 0.42 MIDDLE/HIGH
## 4 0.141 125 0.32 ELEMENTARY
## 5 0.215 60 0.37 MIDDLE/HIGH
## 6 0.155 53 0.36 ELEMENTARY
## AREA_TYPE STREET_ADDRESS CITY STATE ZIP_CODE
## 1 SCHOOL DISTRICT 221 CHENANGO BRIDGE RD BINGHAMTON NY 13901
## 2 SCHOOL DISTRICT 1191 NY RT 79 WINDSOR NY 13865
## 3 SCHOOL DISTRICT 1191 NY RT 79 WINDSOR NY 13865
## 4 SCHOOL DISTRICT 3131 FIVE MILE RD ALLEGANY NY 14706
## 5 SCHOOL DISTRICT 3131 FIVE MILE RD ALLEGANY NY 14706
## 6 SCHOOL DISTRICT 5873 RT 219 ELLICOTTVILLE NY 14731
## LAT LON ow_ob_pct
## 1 42.16298 -75.87401 0.4615877
## 2 42.06670 -75.64087 0.5272087
## 3 42.06670 -75.64087 0.4801536
## 4 42.13191 -78.50214 0.5473752
## 5 42.13191 -78.50214 0.2810499
## 6 42.25484 -78.65117 0.2362356
The next step of this data wrangling process expands on the first by filtering the table further. In this data frame the STATEWIDE region is not included and the top ten percentile of districts with the most overweight or obese students are not included.
source("../02 Data Wrangling/student_wrangle2.R", echo = TRUE, max.deparse.length=1e3)
##
## > st2 <- student_weight %>% mutate(ow_ob_pct = cume_dist(NO_OVERWEIGHT_OR_OBESE)) %>%
## + filter(REGION != "STATEWIDE (EXCLUDING NYC)", GRADE_LEVEL !=
## + "DISTRICT TOTAL", ow_ob_pct <= 0.9)
##
## > head(st2)
## LOCATION_CODE COUNTY AREA_NAME
## 1 30701 BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 3 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 4 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 5 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 6 40901 CATTARAUGUS ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## REGION SCHOOL_YEARS NO_OVERWEIGHT PCT_OVERWEIGHT NO_OBESE
## 1 NEW YORK - PENN 2012-2013 42 0.170 58
## 2 NEW YORK - PENN 2012-2013 54 0.139 65
## 3 NEW YORK - PENN 2012-2013 44 0.175 61
## 4 WESTERN NEW YORK 2012-2013 69 0.174 56
## 5 WESTERN NEW YORK 2012-2013 25 0.153 35
## 6 WESTERN NEW YORK 2012-2013 30 0.203 23
## PCT_OBESE NO_OVERWEIGHT_OR_OBESE PCT_OVERWEIGHT_OR_OBESE GRADE_LEVEL
## 1 0.235 100 0.41 MIDDLE/HIGH
## 2 0.167 119 0.31 ELEMENTARY
## 3 0.242 105 0.42 MIDDLE/HIGH
## 4 0.141 125 0.32 ELEMENTARY
## 5 0.215 60 0.37 MIDDLE/HIGH
## 6 0.155 53 0.36 ELEMENTARY
## AREA_TYPE STREET_ADDRESS CITY STATE ZIP_CODE
## 1 SCHOOL DISTRICT 221 CHENANGO BRIDGE RD BINGHAMTON NY 13901
## 2 SCHOOL DISTRICT 1191 NY RT 79 WINDSOR NY 13865
## 3 SCHOOL DISTRICT 1191 NY RT 79 WINDSOR NY 13865
## 4 SCHOOL DISTRICT 3131 FIVE MILE RD ALLEGANY NY 14706
## 5 SCHOOL DISTRICT 3131 FIVE MILE RD ALLEGANY NY 14706
## 6 SCHOOL DISTRICT 5873 RT 219 ELLICOTTVILLE NY 14731
## LAT LON ow_ob_pct
## 1 42.16298 -75.87401 0.4615877
## 2 42.06670 -75.64087 0.5272087
## 3 42.06670 -75.64087 0.4801536
## 4 42.13191 -78.50214 0.5473752
## 5 42.13191 -78.50214 0.2810499
## 6 42.25484 -78.65117 0.2362356
The final wrangling procedure involves filtering out the top ten percentile of ‘Number of Overweight or Obese’, region ‘Statewide’, and grade level ‘District total’. Then we use a group_by function that groups ‘City’ and ‘Region’ together and obtains the mean value of ‘Percent Overweight or Obese’ based on the groupings.
source("../02 Data Wrangling/student_wrangle3.R", echo = TRUE, max.deparse.length=1e3)
##
## > st3 <- student_weight %>% mutate(pct_cume_dist = cume_dist(PCT_OVERWEIGHT_OR_OBESE)) %>%
## + filter(pct_cume_dist <= 0.9, REGION != "STATEWIDE (EXCLUDING NYC)",
## + GRADE_LEVEL != "DISTRICT TOTAL") %>% group_by(CITY, REGION) %>%
## + mutate(mean = mean(PCT_OVERWEIGHT_OR_OBESE))
##
## > head(st3)
## Source: local data frame [6 x 21]
## Groups: CITY, REGION
##
## LOCATION_CODE COUNTY AREA_NAME
## 1 30701 BROOME CHENANGO VALLEY CENTRAL SCHOOL DISTRICT
## 2 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 3 31701 BROOME WINDSOR CENTRAL SCHOOL DISTRICT
## 4 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 5 40302 CATTARAUGUS ALLEGANY-LIMESTONE CSD
## 6 40901 CATTARAUGUS ELLICOTTVILLE CENTRAL SCHOOL DISTRICT
## Variables not shown: REGION (fctr), SCHOOL_YEARS (fctr), NO_OVERWEIGHT
## (dbl), PCT_OVERWEIGHT (dbl), NO_OBESE (dbl), PCT_OBESE (dbl),
## NO_OVERWEIGHT_OR_OBESE (dbl), PCT_OVERWEIGHT_OR_OBESE (dbl), GRADE_LEVEL
## (fctr), AREA_TYPE (fctr), STREET_ADDRESS (fctr), CITY (fctr), STATE
## (fctr), ZIP_CODE (dbl), LAT (dbl), LON (dbl), pct_cume_dist (dbl), mean
## (dbl)
Now we start plotting out data using the data wrangling procedures performed above. This first graph shows the number of DWI misdemeanor arrests against county population, and then facet-wraps and color codes the data with respect to year. We then apply a linear fit to all of the plots, in order to see the general increasing trend in DWI misdemeanors with increase in county population for every year.
source("../03 Visualizations/Plot1.R", echo = TRUE, max.deparse.length=1e3)
##
## > wr1 %>% ggplot(aes(x = DWI_MISD, y = POPULATION, color = factor(YEAR))) +
## + geom_point() + stat_smooth(method = "lm", se = FALSE) + facet_wrap(~YEAR) +
## + ylim(c(0, 150000)) + theme(legend.position = "none", plot.title = element_text(size = 16,
## + face = "bold", vjust = 2, family = "Helvetica Neue Light")) +
## + labs(x = "Number of DWI Misdemeanor Arrests", y = "County Population") +
## + ggtitle("DWI Misdemeanor Arrests vs NY County Population by Year")
Our second plot makes use of the second wrangling procedure by plotting the density distribution of the sum of property, violence, and firearm crime rates. Notice that this plot is done on a semi-log scale along the x-axis, in order to better show the distribution which would otherwise be difficult to see using an ordinar scale. There are different distribution curves pertaining to each year of data, and it is color-coded per year in order to see the time-dependent trend. Overall, we see that this data follows a normal distribution every year. From this, we see that the mean of the sum of all of these crime rates decreases as time increased (there were less crimes as time progressed, on average), but we also see that the variance of these averages decreased with increasing time, as seen by the taller and narrower distribution curves from recent years.
source("../03 Visualizations/Plot2.R", echo = TRUE, max.deparse.length=1e3)
##
## > wr2 %>% ggplot(aes(x = prop_vio_fire, color = YEAR,
## + group = YEAR)) + geom_density(adjust = 3) + scale_x_log10() +
## + scale_color_continuous(low = "#656565", high = "#FF6200",
## + name = "Year") + labs(x = "log( Combined Property, Firearm, and Violence Crime Rate )",
## + y = "Density") + theme(plot.title = element_text(size = 20,
## + face = "bold", vjust = 2, family = "Helvetica Neue Light")) +
## + ggtitle("New York Property, Firearm, and Violence\nCrime Rate Density Distributions Per Year")
The third plot obviously is used alongside the most complicated wrangling procedure. With this graph we plot all of the melted averages with respect to year on one plot as points. We then plot a fitted curve line with a gray error width, as well as a dotted linear fitted line. As one can see, we make use of the ggplot functions that make our plots more aesthetically pleasing including things like graph titles, unique fonts and color coded regression lines. The linear fitted lines help us see that, despite the fluctuations seen in the data, all of the crime types seem to decrease with time. This may be an indication that New York keeps getting more safe every year, with the decreasing trends in almost every column.
source("../03 Visualizations/Plot3.R", echo = TRUE, max.deparse.length=1e3)
##
## > wr3 %>% ggplot(aes(x = YEAR, y = value, color = variable)) +
## + geom_point(alpha = 0.5) + stat_smooth(alpha = 0.15) + scale_y_log10() +
## + theme(panel.background = element_rect(fill = "white"), panel.grid.major = element_line(colour = "#B2B2B2",
## + size = 0.25), panel.grid.minor = element_line(colour = "#B2B2B2",
## + size = 0.25)) + stat_smooth(method = "lm", se = FALSE,
## + linetype = "dotted") + labs(x = "Year", y = "log( Mean Values )") +
## + theme(plot.title = element_text(size = 20, face = "bold",
## + vjust = 2, family = "Helvetica Neue Light")) + ggtitle("New York Crime Averages Per Year") +
## + scale_color_discrete(name = "Crime Type")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Next we plot a geometric point graph that shows the relationship between the number of obese children and number of overweight children in both elementary and middle/high schools.
source("../03 Visualizations/plot1student.R", echo = TRUE, max.deparse.length=1e3)
##
## > st1 %>% ggplot(aes(x = NO_OVERWEIGHT, y = NO_OBESE,
## + color = GRADE_LEVEL)) + geom_point() + facet_wrap(~REGION) +
## + stat_smooth(method = "lm", se = FALSE) + labs(x = "Number Overweight",
## + y = "Number Obese")
It’s interesting to see that elementary schools, although on average having less of each, have more more “outliers” in the data set. This could be attributed to the restirictions of school sports in elemenary schools or some other unobservable factor(s)
source("../03 Visualizations/plot3student.R", echo = TRUE, max.deparse.length=1e3)
##
## > st3 %>% ggplot(aes(x = LON, y = LAT, color = mean)) +
## + geom_point() + coord_equal() + scale_color_gradient(high = "red",
## + low = "green", name = "Mean percentage\nobese or overweight") +
## + facet_wrap(~SCHOOL_YEARS) + labs(x = "Longitude", y = "Latitude")
## Warning: Removed 14 rows containing missing values (geom_point).
Plotting the individal lat and long coordinates on point graph we are able to see that the data set was taken from all over the state of New York in a relatively thourough study. The outline of each graph is the obvious form of the state itself. Furthermore we are able to deduce from the graph(s) 2 things.
1. Although being overweight or obese was generally seen in all parts of the state there was still a correlation with higher percentages( >.4) in the norther half of New York
2. Between 2012 and 2013 the states’ school districts saw a significant decrease in the amount of overweight or obese students.
We begin by importing the data to Tableau. Since both of our data tables were previously stored in an Oracle database during an earlier project, we use Tableau’s “Connect to Data” wizard by selecting Oracle in the sidebar. Our data can be accessed by using the following credentials:
Server: 129.152.144.84
Service: ORCL.usuniversi01134.oraclecloud.internal
Port: 1521
Username: C##cs329e_eg22453
Password: orcl_eg22453
We do this process two separate times under the same workbook, in order to import the two tables separately and avoid “joining” them.
Once our data is loaded into Tableau, we must add relationships to the data tables to ensure Tableau is able to use them properly. To do this, we go to Data >> Edit Relationships... and add the appropriate relationship that links the data together. For our project, our data was joined by the County and Year columns, as shown below.
We are now ready to create some interesting visualizations. For our first image, we wanted to see how drug and violence-relatated sentences correlated with drug felonies. We did this using a simple line graph, plotting year on the x-axis and the two measures of interest (Drug Sentences, Violent Felony Sentences) on the y-axis. In order to make use of the second data table, we adjusted the size and color of the graphs according to the number of drug arrests. It is clear from the graph colors and the upper graph, that with an increase in drug arrests and sentences, there was a definite spike in the number of violent sentences.
For our second graph, we wanted to see how the ratio of sentences to arrests varied for different types of crimes. In order to do this, we created several calculated fields for drug, DWI, and property crimes. These calculated fields were simply a ratio of the sum of the sentences to the sum of the arrests for each specific crime type, as shown below for the case of drugs.
With these three calculated ratios (drugs, DWI, and property), we then created plots with these the ratios (x-axis) against the total number of sentences (y-axis) across all crime types, to see if the ratio would change. Surprisingly, we found that the ratio of sentences to arrests was fairly consistent within 0 to 0.5 for these three crime types, regardless if the data point was from a high sentence count (5K), or a low one (50). In order to spice up the data, we colored the points according to the sum of the total number of arrests.
For our third plot, we decided to visualize crime parameters for all of the different counties in New York. We decided to make a bubble plot, where each county was colored and labeled distinctly. We then sized each bubble according to the total number of arrests in that county for all types of crime. Because each county was given a bubble for each year, coloring by county was useful to see which counties had consistently high (or consistently low) number of arrests across years. It can be seen that some counties, such as Bronx and Kings, were the ones with consistently high number of arrests, whereas some counties are so small that their labels did not even show up.
For our final plot, we made another calculated field, this time to calculate the ratio of the number of drug felonies to the number of drug misdemeanors, for each county. The calculation was done very similar to the ones under Blend 2, with just a simple division of two sums. We plotted this ratio for all of the different counties using a bar graph, and colored the bars according to their value. If the ratio was over 1, the bars would be red, and anything else would be green. In addition, we adjusted the width of the bars according to the total number of crime sentences for that respective county. Surprisingly, we found that the counties with red bars (that is, more felonies than misdemeanors) were also counties with thin bars (meaning a low number of crime sentences). This seems a little counterintuitive, but there are some exceptions to every case.